Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JIT] Add support to inline the field access of primitive types marked with TLS #82973

Merged
merged 93 commits into from
Apr 11, 2023

Conversation

kunalspathak
Copy link
Member

@kunalspathak kunalspathak commented Mar 4, 2023

Summary

This is an early prototype to add support of inlining ThreadStatic (TS) field access (for now just the primitive types. Today, for fields that are marked with ThreadStatic always has to go through the helper. The helper first gets the current thread, the thread local block, then the ThreadLocalModule for the current moduleIndex and lastly the static data block. This prototype tries to inline such accesses by adding few data structures that acts as a cache.

A thread local cache t_threadStaticBlocks (array of pointers) is added which stores the static blocks corresponding to the given thread.

During codegen, the enclosing type of the TS field that is being access (load or store) is monitored and gets a unique index. This index is the position at which the static data block will be stored in t_threadStaticBlocks cache at runtime. Since the index is assigned during codegen and embedded in the code, it remains same for any thread that executes the code at runtime. Hence any thread that executes the code, will make sure that it gets the relevant static data block from t_threadStaticBlocks cache.

The t_threadStaticBlocks is populated during runtime as well. The first time the field access code is executed, it tries to find the static data block in t_threadStaticBlocks but doesn't find it. It fallbacks to the slow path which is the existing helper call. The helper call has been modified to update the t_threadStaticBlocks with the static data block. Next time when the TS field access code is executed, the entry is found in the cache, and we skip going to the helpers.

To access the t_threadStaticBlocks at runtime, code is generated to access the relevant cache for the current thread by fetching the TLS of current thread, getting the slot for runtime, and then getting the t_threadStaticBlocks present in that slot.

Disassembly

To understand how it works, consider the following C# code.

[MethodImpl(MethodImplOptions.NoInlining)]
public int GetThreadStaticInt() => t_threadStaticIntValue;

[ThreadStatic]
private static int t_threadStaticIntValue = 0;

Today, we always generate a helper call to retrieve the t_threadStaticIntValue field value.

G_M000_IG01:                ;; offset=0000H
       4883EC28             sub      rsp, 40
 
G_M000_IG02:                ;; offset=0004H
       48B9D0482999FA7F0000 mov      rcx, 0x7FFA992948D0
       BA78010000           mov      edx, 376
       E81894A65F           call     CORINFO_HELP_GETSHARED_NONGCTHREADSTATIC_BASE_NOCTOR
       8B80A4040000         mov      eax, dword ptr [rax+04A4H]
 
G_M000_IG03:                ;; offset=001EH
       4883C428             add      rsp, 40
       C3                   ret  

With this prototype, this is what we would generate:

G_M10050_IG01:              ;; offset=0000H
       4883EC28             sub      rsp, 40

G_M10050_IG02:              ;; offset=0004H
       65488B042558000000   mov      rax, qword ptr GS:[0x0058]     ; Access the TLS of current thread
       488B4030             mov      rax, qword ptr [rax+30H]       ; Get the runtime TLS slot from TLS[_tls_index]
       4883784802           cmp      qword ptr [rax+48H], 2         ; See if length of `t_threadStaticBlocks` < typeIndex. Here `typeIndex == 2`.
       7F15                 jg       SHORT G_M10050_IG05            ; If yes, then proceed, else fallback to the helper

G_M10050_IG03:              ;; offset=0018H
       488B4050             mov      rax, qword ptr [rax+50H]       ; Get the `t_threadStaticBlocks`.
       488B4010             mov      rax, qword ptr [rax+10H]       ; Get the `t_threadStaticBlocks[typeIndex]`.
       4885C0               test     rax, rax                       ; Check if a valid entry is present at `t_threadStaticBlocks[typeIndex]`
       7408                 je       SHORT G_M10050_IG05            ; If invalid, then go to the helper
       8B4038               mov      eax, dword ptr [rax+38H]       ; If valid, then get the field value.

G_M10050_IG04:              ;; offset=0028H
       4883C428             add      rsp, 40
       C3                   ret

G_M10050_IG05:              ;; offset=002DH
       48B990296324FA7F0000 mov      rcx, 0x7FFA24632990
       BA0D000000           mov      edx, 13
       41B802000000         mov      r8d, 2                          ; This is a new paramter to the helper that would cache static data block in `t_threadStaticBlocks` at index `2`.
       E849D8545F           call     CORINFO_HELP_GETSHARED_NONGCTHREADSTATIC_BASE_NOCTOR
       8B4038               mov      eax, dword ptr [rax+38H]
       EBDC                 jmp      SHORT G_M10050_IG04

Details

TODO

TODO

  • Rename bunch of variables with relavent names
  • Add support for dynamic memory allocation for the t_threadStaticBlocks.
  • Add a new helper for TLS access. Currently the existing helper is being used with an added parameter for the typeIndex.
  • Mark the block containing helper call as rarely run, so it is moved to the bottom of the method.
  • Extract the common code in importer to a method.
  • Performance measurements
  • Hoisting some checks out of loop. Should the expansion happen in importer or some later phase?

Contributes to #79521.

@ghost ghost assigned kunalspathak Mar 4, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 4, 2023
@ghost
Copy link

ghost commented Mar 4, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

Summary

This is an early prototype to add support of inlining ThreadStatic (TS) field access (for now just the primitive types. Today, for fields that are marked with ThreadStatic always has to go through the helper. The helper first gets the current thread, the thread local block, then the ThreadLocalModule for the current moduleIndex and lastly the static data block. This prototype tries to inline such accesses by adding few data structures that acts as a cache.

A thread local cache t_threadStaticBlocks (array of pointers) is added which stores the static blocks corresponding to the given thread.

During codegen, the enclosing type of the TS field that is being access (load or store) is monitored and gets a unique index. This index is the position at which the static data block will be stored in t_threadStaticBlocks cache at runtime. Since the index is assigned during codegen and embedded in the code, it remains same for any thread that executes the code at runtime. Hence any thread that executes the code, will make sure that it gets the relevant static data block from t_threadStaticBlocks cache.

The t_threadStaticBlocks is populated during runtime as well. The first time the field access code is executed, it tries to find the static data block in t_threadStaticBlocks but doesn't find it. It fallbacks to the slow path which is the existing helper call. The helper call has been modified to update the t_threadStaticBlocks with the static data block. Next time when the TS field access code is executed, the entry is found in the cache, and we skip going to the helpers.

To access the t_threadStaticBlocks at runtime, code is generated to access the relevant cache for the current thread by fetching the TLS of current thread, getting the slot for runtime, and then getting the t_threadStaticBlocks present in that slot.

Disassembly

To understand how it works, consider the following C# code.

[MethodImpl(MethodImplOptions.NoInlining)]
public int GetThreadStaticInt() => t_threadStaticIntValue;

[ThreadStatic]
private static int t_threadStaticIntValue = 0;

Today, we always generate a helper call to retrieve the t_threadStaticIntValue field value.

G_M000_IG01:                ;; offset=0000H
       4883EC28             sub      rsp, 40
 
G_M000_IG02:                ;; offset=0004H
       48B9D0482999FA7F0000 mov      rcx, 0x7FFA992948D0
       BA78010000           mov      edx, 376
       E81894A65F           call     CORINFO_HELP_GETSHARED_NONGCTHREADSTATIC_BASE_NOCTOR
       8B80A4040000         mov      eax, dword ptr [rax+04A4H]
 
G_M000_IG03:                ;; offset=001EH
       4883C428             add      rsp, 40
       C3                   ret  

With this prototype, this is what we would generate:

G_M000_IG01:                ;; offset=0000H
       56                   push     rsi
       4883EC20             sub      rsp, 32
 
G_M000_IG02:                ;; offset=0005H
       65488B0C2558000000   mov      rcx, qword ptr GS:[0x0058]           ; Access the TLS of current thread
       488B7130             mov      rsi, qword ptr [rcx+30H]             ; Get the runtime TLS slot from TLS[_tls_index]
       4883BEB801000002     cmp      qword ptr [rsi+1B8H], 2              ; See if length of `t_threadStaticBlocks` > typeIndex. Here `typeIndex == 2`.
       7E22                 jle      SHORT G_M000_IG04                    ; If yes, then proceed, else fallback to the helper
 
G_M000_IG03:                ;; offset=001CH
       48B9D048F522FA7F0000 mov      rcx, 0x7FFA22F548D0
       BA78010000           mov      edx, 376
       41B802000000         mov      r8d, 2                               ; This is a new paramter to the helper that would cache static data block in `t_threadStaticBlocks` at index `2`.
       E8FAF8AB5F           call     CORINFO_HELP_GETSHARED_NONGCTHREADSTATIC_BASE_NOCTOR
       8B80A4040000         mov      eax, dword ptr [rax+04A4H]
       EB16                 jmp      SHORT G_M000_IG06
 
G_M000_IG04:                ;; offset=003EH
       488B86B0010000       mov      rax, qword ptr [rsi+1B0H]             ; Get the `t_threadStaticBlocks[typeIndex]`.
       488B4010             mov      rax, qword ptr [rax+10H]              ; Get the `t_threadStaticBlocks[typeIndex]`.
       4885C0               test     rax, rax                              ; Check if a valid entry is present at `t_threadStaticBlocks[typeIndex]`.         
       74CE                 je       SHORT G_M000_IG03                     ; If invalid, then go to the helper
 
G_M000_IG05:                ;; offset=004EH
       8B80A4040000         mov      eax, dword ptr [rax+04A4H]            ; If valid, then get the field value.
 
G_M000_IG06:                ;; offset=0054H
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret     

Details

TODO

TODO

  • Rename bunch of variables with relavent names
  • Add support for dynamic memory allocation for the t_threadStaticBlocks.
  • Add a new helper for TLS access. Currently the existing helper is being used with an added parameter for the typeIndex.
  • Mark the block containing helper call as rarely run, so it is moved to the bottom of the method.
  • Performance measurements

Contributes to #79521.

Author: kunalspathak
Assignees: kunalspathak
Labels:

area-CodeGen-coreclr

Milestone: -

@tannergooding
Copy link
Member

tannergooding commented Mar 4, 2023

@kunalspathak, can you also give an example of the use of a TLS in a for loop so we can see what's getting hoisted (one time cost) vs what is repeated cost?

For example, what is the codegen given the following:

    [ThreadStatic]
    public static volatile int t_value;
    
    public static int Test(uint count)
    {
        int sum = 0;
        
        for (uint i = 0; i < count; i++)
        {
            sum += t_value;
        }
        
        return sum;
    }

I'd expect we end up with the large up front block as "hoistable", that is the initial resolution of the TLS base. The inner loop should then remain "small" and effectively just a direct memory access since we'll have already resolved the base/offset of the TLS value for the given thread.

Notably it also looks like there is a "cheap check" and "expensive fallback" to the TLS handling here. The way the blocks are being ordered doesn't look to mark the expensive fallback as "cold" which might negatively impact things, particularly for subsequent executions of the code.

@kunalspathak
Copy link
Member Author

Failures are known issues.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple hopefully minor questions/comments

src/coreclr/jit/helperexpansion.cpp Show resolved Hide resolved
src/coreclr/jit/helperexpansion.cpp Show resolved Hide resolved
src/coreclr/jit/helperexpansion.cpp Outdated Show resolved Hide resolved
@EgorBo
Copy link
Member

EgorBo commented Apr 10, 2023

LGTM with a couple of nits

@kunalspathak
Copy link
Member Author

Other failures are unrelated and fixed by #84649

@kunalspathak kunalspathak merged commit 563408a into dotnet:main Apr 11, 2023
@kunalspathak kunalspathak deleted the tls branch April 11, 2023 20:53
@ghost ghost locked as resolved and limited conversation to collaborators May 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.